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Abstract 

We consider the problems of clustering, classification, and visualization of high-dimensional data 
when no straightforward Euclidean representation exists. Typically, these tasks are performed by first 
reducing the high-dimensional data to some lower dimensional Euclidean space, as many manifold 
learning methods have been developed for this task. In many practical problems however, the assumption 
of a Euclidean manifold cannot be justified. In these cases, a more appropriate assumption would be 
that the data lies on a statistical manifold, or a manifold of probabiUty density functions (PDFs). In this 



> 

o . 

l/^ , paper we propose using the properties of information geometry in order to define similarities between data 

. sets using the Fisher information metric. We will show this metric can be approximated using entirely 

, non-parametric methods, as the parameterization of the manifold is generally unknown. Furthermore, 

, by using multi-dimensional scaling methods, we are able to embed the corresponding PDFs into a low- 

. dimensional Euclidean space. This not only allows for classification of the data, but also visualization of 

the manifold. As a whole, we refer to our framework as Fisher Information Non-parametric Embedding 
, (FINE), and illustrate its uses on a variety of practical problems, including bio-medical applications and 

^ . document classification. 

I. Introduction 

The fields of statistical learning and machine learning are used to study problems of inference, which 
is to say gaining knowledge through the construction of models in order to make decisions or predictions 
based on observed data [1]. Statistical learning examines problems such as observing natural associations 
between data sets (clustering), and predicting to which class of known groupings an unlabeled data set 
belongs (classification), based on some model defined by a priori knowledge of the data. Machine learning 

Acknowledgement: This work is partially funded by the National Science Foundation, grant No. CCR-0325571. 



February 14, 2008 



DRAFT 



2 



introduces a non-parametric approach to these learning tasks via model-free learning from examples. 
Recent work on manifold learning aims at the high dimension regime, in which examples are governed by 
geometrical constraints effectively reducing the dimension of the problem from a high extrinsic dimension 
to a low intrinsic dimension. On the other hand, information geometry aims at understanding the structure 
of statistical models and introduces a geometric perspective to inference problems [2]. 

We are interested in the cross section of the three fields; using the principles of each to solve problems 
that do not fit within the framework of any of the individual fields. Often data does not exhibit a low 
intrinsic dimension in the data domain as one would have in manifold learning. A straightforward strategy 
is to express the data in terms of a low-dimensional feature vector for which the curse of dimensionality 
is alleviated. This initial processing of data as real-valued feature vectors in Euchdean space, which is 
often carried out in an ad hoc manner, has been called the "dirty laundry" of machine learning [3]. This 
procedure is highly dependent on having a good model for the data and in the absence of such model 
may be highly suboptimal. When a statistical model is available, the process of obtaining a feature vector 
can be done optimally by extracting the model parameters for a given data set and thus characterizing 
the data through its lower dimensional parameter vector. We are interested in extending this approach to 
the case in which the data follows an unknown parametric statistical model. 

While the problem of learning in a Euchdean space is well defined, there are many problems in which 
the data cannot be appropriately represented by a Euchdean manifold, and the model parameters are 
unspecified and must be learned through the data. In flow cytometry, pathologists study blood samples 
containing many cells taken from a patient. Each individual cell is analyzed with different fluorescent 
markers, resulting in a large, high-dimensional data set. This is assumed to be a reahzation of some 
overriding parametric model, but the model parameters are unknown. Pathologists desire the ability to 
appropriately classify patients with differing ailments that may express similar responses to these markers. 
For the purposes of analysis and visualization, it is then necessary to reduce the dimensionahty of these 
sets. The problem of document classification is one in which the data is clearly non-Euchdean, as each 
set is a collection of words from a dictionary. It is still desired to distinguish between documents by 
forming clusters of different similarities. A standard method is to form a probabihty distribution over 
a dictionary and use methods of information geometry to determine a similarity between data sets [4]. 
Applications of statistical manifolds have also been presented in the cases of face recognition [5], texture 
segmentation [6], image analysis [7], and shape analysis [8]. 

A common theme to all of the problems presented above is that the model from which the data is 
generated is unknown. In this paper, we present a framework to handle such problems. Specifically, we 
focus on the case where the data is high-dimensional and no lower dimensional Euchdean manifold 
gives a sufficient description. In many of these cases, a lower dimensional statistical manifold can be 
used to assess the data for various learning tasks. We refer to our framework as Fisher Information Non- 
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parametric Embedding (FINE), and it includes characterization of data sets in terms of a non-parametric 
statistical model, a geodesic approximation of the Fisher information distance as a metric for evaluating 
similarities between data sets, and a dimensionahty reduction procedure to obtain a low-dimensional 
Euchdean embedding of the original high-dimensional data set for the purposes of both classification and 
visualization. 

Statistical manifolds in both the parametric and non-parametric settings have been well discussed [9], 
[10]. Our work differs in that we assume the manifold is derived from some natural parameterization, only 
that set of parameters is unknown. There has been much work presented on the use of statistical manifolds 
[4], [7], [11], [12] and information geometry [13], [14] in learning problems, all proposing alternatives 
to using Euchdean geometry for data modehng. These methods focus on clustering and classification, 
and do not explicitly address the problems of dimensionality reduction (embedding each set into a low- 
dimensional Euchdean space) and visualization. Additionally, they focus on parameter estimation as a 
necessity for their methods, as opposed to our work which is performed in a non-parametric setting. We 
provide a start-to-finish framework which enables analysis of high-dimensional data through non-linear 
embedding into a low-dimensional space by information, not Euclidean, geometry. Our methods require 
no exphcit model assumptions; only than that the given data is a realization from an unknown model 
with some natural parameterization. 

Recent work by Lee et al. [15] similar to our own [16], [17] has demonstrated the use of statistical 
manifolds for dimensionality reduction. While each work has been developed independently and originally 
presented at nearly the same time, they share enough similarities that we now express the different 
contributions of our own work. Specifically, we consider the work presented by Lee et al. to be a 
speciahzed case of our more general framework. They focus on the specific case of image segmentation, 
which consists of multinomial distributions as points which he on an n-simplex (or projected onto an 
n + 1-dimensional sphere). By framing their problem as such, they are able to exploit the properties of 
such a manifold: using the cosine distance as an exact computation of the Fisher information distance, 
and using linear methods (PCA) of dimensionahty reduction. They have shown very promising results 
for the problem of image segmentation, and briefly mention the possibihty of using non-linear methods 
of dimensionahty reduction, which they consider unnecessary for their problem. The work we present 
differs in that we make no assumptions on the type of distributions making up the statistical manifold. 
As such, our geodesic approximation for the Fisher information accounts for submanifolds of interest. 
This is illustrated later in Fig. 3, where the submanifold hes on the n + 1-dimensional sphere, but 
does not fill the entire space. As such, there is no exact measure of the Fisher information between 
points, and we must approximate with a geodesic along the manifold. Additionally, we utihze non-hnear 
methods of dimensionality reduction, which we consider to be more relevant for many non-linear types of 
applications. Finally, by considering all statistical manifolds rather than focusing on those of consisting 
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of multinomial distiibutions, we are able to apply our methods to many problems of practical interest. 

This paper is organized as follows: Section n describes a background in information geometry and 
statistical manifolds. Section m gives the formulation for the problem we wish to solve, while Section IV 
develops and outhnes the FINE algorithm. We illustrate the results of using FINE on real and synthetic 
data sets in Section V. Finally, we draw conclusions and discuss the possibilities for future work in 
Section VI. 

II. Background on Information Geometry 

Information geometry is a field that has emerged from the study of geometrical structures on manifolds 
of probability distributions. These investigations analyze probabiUty distributions as geometrical struc- 
tures in a Riemannian space. Using tools and methods deriving from differential geometry, information 
geometry is applicable to information theory, probabiUty theory, and statistics. The field of information 
theory is largely based on the works of Shun'ichi Amari [18] and has been used for analysis in such 
fields as statistical inference, neural networks, and control systems. In this section, we will give a brief 
background on the methods of information geometry that we utiUze in our framework. For a more 
thorough introduction to information geometry, we suggest [19] and [2]. 

A. Differential Manifolds 

The concept of a differential manifold is similar to that of a smooth curve or surface lying in a high- 
dimensional space. A manifold M can be intuitively thought of as a set of points with a coordinate 
system. These points can be from a variety of constructs, such as EucHdean coordinates, hnear system, 
images, or probabiUty distributions. Regardless of the definition of the points in the manifold M, there 
exists a coordinate system with a one-to-one mapping from M to W^, and as such, d is known as the 
dimension of M. 

For reference, we will refer to the coordinate system on M as tp : M ^ M!^. If ip has M as its 
domain, we call it a global coordinate system [2]. In this situation, ijj is a one-to-one mapping onto 
for all points in M. A manifold is differentiable if the coordinate system mapping tp is differentiable 
over its entire domain. If ip is infinitely differentiable, the manifold is said to be 'smooth' [19]. 

In many cases there does not exist a global coordinate system. Examples of such manifolds include the 
surface of a sphere, the "swiss roll", and the torus. For these manifolds, there are only local coordinate 
systems. Intuitively, a local coordinate system acts as a global coordinate system for a local neighborhood 
of the manifold, and there may be many local coordinate systems for a particular manifold. Fortunately, 
since a local coordinate system contains the same properties as a global coordinate system (only on a 
local level), analysis is consistent between the two. As such, we shall focus solely on manifolds with a 
global coordinate system. 
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1 ) Statistical Manifolds: Let us now present the notion statistical manifolds, or a set M. whose elements 
are probabiUty distributions. A probability distribution function (PDF) on a set X is defined as a function 
p : — ^ R in which 

p{x) >o,yxex (1) 



/ 



p{x) dx = 1. 



We describe only the case for continuum on the set X, however if X was discrete valued, equation (1) 
will still apply by switching J p{x) dx = 1 with J2p{^) = 1- If we consider to be a family of PDFs 
on the set X, in which each element of 7W is a PDF which can be parameterized hy 6 = [6^, . . . ,6'^], 
then M is known as a statistical model on X. Specifically, let 

M = {p{x I ^) I ^ G e C R"^}, (2) 

with p{x I 6) satisfying the equations in (1). Additionally, there exists a one-to-one mapping between 9 
and p{x I 9). 

Given certain properties of the parameterization of M, such as differentiabiHty and C°° diffeomorphism 
(details of which are described in [2]), the parameterization 6 is also a coordinate system of At. In this 
case, M is known as a statistical manifold. In the rest of this paper, we will use the terms 'manifold' 
and 'statistical manifold' interchangeably. 

B. Distances on Manifolds 

In Euclidean space, the distance between two points is defined as the length of a straight line between 
the points. On a manifold, however, one can measure distance by a trace of the shortest path between 
the points along the manifold. This path is called a geodesic, and the length of the path is the geodesic 
distance. In information geometry, the distance between two points on a manifold is analogous to the 
difference in information between them, and is defined by the Fisher information metric. 

1 ) Fisher Information Metric: The Fisher information measures the amount of information a random 
variable X contains in reference to an unknown parameter 9. For the single parameter case it is defined 
as 

d 



i{9) = E i^-^iogfix-e)] \e 

If the condition J ■^f{X;9) dX = is met, then the above equation can be written as 



I{9) = -E 



2 
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For the case of multiple parameters = [0^, ... ,^"], we define the Fisher information matrix 
whose elements consist of the Fisher information with respect to specified parameters, as 



For a parametric family of probability distributions, it is possible to define a Riemannian metric using 
the Fisher information matrix, known as the information metric. The information metric distance, or 
Fisher information distance, between two distributions p(x; Oi) and p{x; $2) in a single parameter family 
is ^ 

DFiei,e2) = [ 'i{ey/''de, (4) 

where 6i and 02 are parameter values corresponding to the two PDFs and 1{6) is the Fisher information 
for the parameter 9. Extending to the multi-parameter case, we obtain: 




d9 



Df{9i,92) = min / W 1(e) (5) 

2) Example: Here we present a derivation of a geodesic distance between univariate Gaussian densities 
via the Fisher information metric for two reasons. First, we would like to illustrate how involved the 
process is for such a simple family of PDFs. Secondly, we present a process of deriving the Fisher 
information metric that is involved in computing the geodesic distance. Let us consider the family of 
univariate Gaussian distributions V = {pi, . . . ,Pn}, where 

Pi{x) = exp {-{x - Hi)'^/2af). 

'2^ 



For the case of V parameterized by ^ = ( , cr ) , the resultant Fisher information matrix is 




We omit the derivation, which can be found in [19] and is straight forward from (3). 

We define the distance between two points on the manifold as the minimum length between all paths 
connecting the two points. Using the inner product associated with the Fisher information matrix 

<u,v >F= u^[I{9)]v, 

we define the length of the path P between two points parameterized by 9i and 92, on the manifold M 
as 

11^1 ~ 92\\p = \^ < 9i — 92, 9i — 02 >F- 
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Using the parameterization 0{t) such that ^(0) = 9i and ^(1) = 62, we obtain the length of P as 



We are able to define the distance between points pi = p{x; 9i) and p2 = p{x; O2) as the minimum over 
all path lengths defined above 



-Df(pi,P2) 



min a/2 

m JO 




(6) 



|MWandc7=|a(t). 



where fi 

The solution to (6) is the well known Poincare hyperbolic distance, in which the shortest path between 
two points is the length of an arc on a circle in which both points are at a radius length from the circle's 
center. In the case of the univariate normal distribution, this arc is a straight line when the mean is held 
constant and the variance is changed. 

By changing variables and parameterizing a as a function of ^, we obtain: 

min 

(T(/i);<T(/il)=<Tl,<T(/i2)=<r2 

where & = ^cr(A*)- It should be clear that this is a representation of (4). It should also be noted that 
there exists a one-to-one mapping a{fi) : M along the geodesic from (t(/xi) to cr(/X2), except for 

the case when ni = ^2- 

Solving (6) becomes a problem of calculus of variations. For the univariate normal family of distri- 
butions, this has been calculated in a closed-form expression presented in [20], determining the Fisher 
information distance as: 




Df{vi,Pi) = "\/21og 





( 

1^2' 


-0-2) 


+ 


( Ml 


"0 - 




( ^^2 


-0-2) 




( Ml 
\V2 





(7) 



For visuahzation, let us define a set of probabihty densities V = {pi{x)} on a grid, such that pi = pk,i 
is parameterized by (/Xj, crj) = {ak, 1 + (31), k,l = 1 . . .n and a, /? G M. Figure 1 shows a mesh-grid and 
contour plot of the Fisher information distance between the density defined by (/Xj, ctj) = (0.6, 1.5) and 
the neighboring densities on the setV(a = P = 0.1). 

111. Problem Formulation 

A key property of the Fisher information metric is that it is independent of the parameterization of the 
manifold [7], [19]. Although the evaluation remains equivalent, calculating the FIM requires knowledge 
of the parameterization, which is generally not available. We instead assume that the collection of density 
functions he on a manifold that can be described by some natural parameterization. Specifically, we are 
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given V = {pi, . . . ,Pn\, where G is a PDF and 7W is a manifold embedded in S, the simplex of 
densities in Li. Under these circumstances, it is important to note that much of the same theory still applies 
for determining dissimilarity between probability distributions. Our goal is to find an approximation for 
the geodesic distance between points on J\A using only the information available in V. Can we find an 
approximation function G which yields 

DF{Pi,Pj) = G{pi,pj;V), (8) 

such that DpiPiiPj) — > Dp{pi,pj) as n ^ ool 

This problem is similar to the setting of classical papers [21], [22] in manifold learning and dimen- 
sionality reduction, where only a set of points on the manifold are available. As such, we are able to 
use these manifold learning techniques to construct a low-dimensional embedding of that family. This 
not only allows for an effective visualization of the manifold (in 2 or 3 dimensions), but by reducing 
the effect of the curse of dimensionality we can perform clustering and classification on the family of 
distributions lying on the manifold. 

A. Approximation of Fisher Information Distance 

The Fisher information distance is consistent, regardless of the parameterization of the manifold [7]. 
This fact enables the approximation of the information distance when the specific parameterization of the 
manifold is unknown, and there have been many metrics developed for this approximation. An important 
class of such divergences is known as the /-divergence [23], in which f{u) is a convex function on 
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n > and 

'q{x) 



Dfiph) = j p{x)f 



A specific and important example of the /-divergence is the a-divergence, where D^") = Df^ for a 
real number a. The function f^"^ (u) is defined as 

' (1 - a^±l 

/H(u) = ulogu a = l 

— log u a = — 1 

As such, the a-divergence can be evaluated as 

D^^\p\\q) = (l - y" p{xy-^q{xy-^dx^ a ^ 1, 

D^-'\p\\q) = D('\q\\p) = J p(x)log^. (9) 



and 



The a-divergence is the basis for many important and well known divergence metrics, such as the 
Helhnger distance, the KuUback-Leibler divergence, and the Renyi- Alpha entropy [24]. 
1 ) Kullback-Leibler Divergence: The KuUback-Leibler (KL) divergence is defined as 

KL{p\\q)= |p(x)log^, (10) 

which is equal to D^~^^ (9). The KL-divergence is a very important metric in information theory, and is 
commonly referred to as the relative entropy of one PDF to another. Kass and Vos show [19] the relation 
between the Kullback-Leibler divergence and the Fisher information distance is 

^2KL{jp\\q) ^ Dpijp^q) 

as p ^ q. This allows for an approximation of the Fisher information distance, through the use of the 
available PDFs, without the need for the specific parameterization of the manifold. 

Returning to our illustration developed in Section II-B2, we have defined the data set V of univariate 
normal distributions, and presented an expression for the Fisher information distance on the resultant 
manifold (7). The Kullback-Leibler divergence between univariate normal distributions is also available 
in a closed-form expression: 

' ~ ' ,2 \ 

^2 /„2 




KL{p,\\pj) = log 4 + ^ + - la] - 1 



To compare the KL-divergence to the Fisher information distance, we define the error as, E = 
I ^J2KL{pi\pj) — Dp{pi^pj)\, where pi^j G 7^. In Fig. 2 we display the mesh-grid and contour plots of 
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(a) Mesh-grid (b) Contour plot 

Fig. 2. a) Mesh-grid and b) Contour plots of the error between the KL-divergence and the Fisher information distance based on 
a grid of univariate normal densities, parameterized by (/i,cr). E = ^y^2KL{pi \\pj) — DF{pi,Pj)^- Note that \plKL ~* Dp, 
where pi is denoted by the red star. 



E, where point pi is held constant in the center of the grid defining V, and pj varies about the manifold. 
As described earlier, as the density pj — > pi, the error £' ^ 0. In Fig. 2(b), the reference point pi is 
noted by the red star. 

It should be noted that the KL-divergence is not a distance metric, as it does not satisfy the symmetry, 
KL{p\\q) 7^ KL{p\\q), or triangle inequality properties of a distance metric. To obtain this symmetry, 
we will define the KL-divergence as: 

DKL{p,q)=KL{p\\q)+KL{q\\p), (11) 

which is symmetric, but still not a distance as it does not satisfy the triangle inequality. Since the Fisher 
information is a symmetric measure, we can relate the symmetric KL-divergence and approximate the 
Fisher information distance as 

^DKLip,q)^ DF{p,q), (12) 

as p — > g. 

2) Hellinger Distance: Another important result of the a-divergence is the evaluation with a = 0: 

D^^\p\\q)=2 j (v^-v^)'rf^, 
which is called the closely related to the Hellinger distance. 
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which satisfies the axioms of distance - symmetry and the triangle inequahty. The Hellinger distance is 
related to the information distance in the hmit by 



as p ^ q [\9]. We note that the HelUnger distance is related to the KuUback-Leibler divergence, as in 
the limit ^yKL{p\\q) ^ Dh{p, q). 

3) Other Fisher Approximations: There are other metrics which approximate the Fisher information 
distance, such as the cosine distance. When deahng with multinomial distributions, the approximation 



is the natural metric on the sphere. 

We restrict our analysis to that of the KuUback-Leibler divergence and the HelUnger distance. The KL- 
divergence is a great means of differentiating shapes of continuous PDFs. Analysis of (10) shows that 
as p{x)/q{x) oo, KL{p\\q) oo. These properties ensure that the KL-divergence will be ampUfied 
in regions where there is a significant difference in the probability distributions. This cannot be used 
in the case of a multinomial PDF, however, because of divide-by-zero issues. In that case the Hellinger 
distance is the desired metric as there exists a monotonic transformation function : Dh —>■ Dq [19]. 
For additional measures of probabilistic distance, some of which approximate the Fisher information 
distance, and a means of calculating them between data sets, we refer the reader to [25]. 

B. Approximation of Distance on Statistical Manifolds 

We have shown the approximation function Df{pi,P2) of the Fisher information distance between pi 
and p2 can be calculated using a variety of metrics as pi p2. If pi and p2 do not lie closely together on 
the manifold, these approximations become weak. An example of this is illustrated in Fig. 3, where the 
manifold of interest lies in a subspace of another manifold, and the distance between two points should 
be considered as the distance traveled on the manifold of interest. A good approximation can still be 
achieved if the manifold is densely sampled between the two end points. By defining the path between 
pi and p2 as a series of connected segments and summing the length of those segments, we approximate 
the distance of the geodesic, which is the shortest path along the manifold. Specifically, given the set of 
n PDFs parameterized by Ve = {^i, • • • , On}^ the Fisher information distance between pi and p2 can be 
estimated as: 



2DH{p,q) ^ Dpip^q) 



Dc{p, = 2 arccos / ^/p^ Df{p, q), 



m 




Df{pi,p2)^ r-,™™. .^DF{p{e^i)),p{e^i+i))), Vi 
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Fig. 3. The Fisher information distance between points cannot be exactly calculated about a manifold if the data exists on a 
submanifold of interest (shaded area). Rather than directly calculating the distance between points (A) , the distance should be 
approximated by a geodesic along the submanifold (B). 



Using our approximation of the Fisher information distance as pi p2 (whether KL-divergence or 
Hellinger distance is of no immediate concern), we can now define an approximation function G for all 
pairs of PDFs: 

m 

G{pi,P2;V) =mm'^DF{p(i),P(i+i)), ^ V i (13) 

1=1 

where V = {pi, . . . ,pn} is the available collection of PDFs on the manifold. Intuitively, this estimate 
calculates the length of the shortest path between points in a connected graph on the well sampled 
manifold, and as such G{pi,p2;V) — > Dp{pi,p2) as n ^ oo. This is similar to the manner in which 
Isomap [21] approximates distances on Euclidean manifolds. Figure 4 illustrates this approximation 
by comparing the KL graph approximation to the actual Fisher information distance for the univariate 
Gaussian case. As the manifold is more densely sampled (uniformly in mean and variance parameters 
for this simulation), the approximation converges to the true Fisher information distance, as calculated 
in (7). 

C. Dimensionality Reduction 

Given a matrix of dissimilarities between entities, many algorithms have been developed to find a 
low-dimensional embedding of the original data : ^A —>■ . These techniques have been classified as 
a group of methods called Multi-Dimensional Scaling (MDS). There are supervised methods, which are 
generally used for classification purposes, and unsupervised methods, which are often used for clustering 
and manifold learning. Using these MDS methods allows us to find a single low-dimensional coordinate 
representation of each high-dimensional, large sample, data set. 
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Fig. 4. Convergence of the graph approximation of the Fisher information distance using the Kullback-Leibler divergence. As 
the manifold is more densely sampled, the approximation approaches the true value. 

1) Classical Multi-Dimensional Scaling: Classical MDS (cMDS) takes a matrix of dissimilarities and 
embeds each point into a Euclidean space. This is performed by first centering the dissimilarities about the 
origin, then calculating the eigenvalue decomposition of the centered matrix. This unsupervised method 
permits the calculation of the low-dimensional embedding coordinates which reveal any natural separation 
or clustering of the data. 

Define D as a dissimilarity matrix which contains (or approximates) Euclidean distances. Let B be 
the "double centered" matrix which is calculated by taking the matrix D, subtracting its row and column 
means, then adding back the grand mean and multiplying by — i. As a result, B is a version of D 
centered about the origin. Mathematically, this process is solved by 

B = --HD^H, 
2 

where H = I — {1/N)11^, I is the A^-dimensional identity matrix, and 1 is an A^-element vector of 
ones. 

The embedding coordinates, Y e M'^^", can then be determined by taking the eigenvalue decomposition 
of B, 

B = [FiFsldiag (Ai, A^) [ViV2f, 

and calculating 

r = dlag(A^/^...,Af)y,-. 

The matrix Vi consists of the eigenvectors corresponding to the d largest eigenvalues Ai, . . . , A^ while 
the remaining N — d eigenvectors are represented as V2- The term 'diag(Ai, . . . , Aat)' refers to an N x N 



KL Geodesic Approximation 

^^^^^^ Fisher Information Distance 
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Fig. 5. Classical MDS to the matrix of a) Fisher information distances and b) Kullback-Leibler geodesic approximations of 
the Fisher information distance, on a grid of univariate normal densities, parameterized by {fi, a) 



diagonal matrix with Aj as its i diagonal element. 

To continue our illustration from Section II-B2, let D be the matrix of Fisher information distances de- 
fined in (7) for the set of univariate normal densities V, where D{i,j) = Dp{pi,pj). Figure 5(a) displays 
the results of applying cMDS to D. We demonstrate the embedding with the geodesic approximation of 
the Fisher information distance (13) in Fig. 5(b), which is very similar to the embedding created with 
the exact values. It is clear that while the densities defining the set V are parameterized on a rectangular 
grid, the manifold on which V lives is not rectangular itself, which is due to the differing effects that 
changes in mean and variance have on the Gaussian PDF. 

2) Laplacian Eigenmaps: Laplacian Eigenmaps (LEM) is an unsupervised technique developed by 
BeMn and Niyogi and first presented in [22]. This performs non-linear dimensionality reduction by per- 
forming an eigenvalue decomposition on the graph Laplacian formed by the data. As such, this algorithm 
is able to discern low-dimensional structure in high-dimensional spaces that were previously indiscernible 
with methods such as principal components analysis (PCA) and classical MDS. The algorithm contains 
three steps and works as follows: 

1) Construct adjacency graph 

Given dissimilarity matrix Dx between data points in the set X, define the graph G over all data 
points by adding an edge between points i and j if Xi is one of the /c-nearest neighbors of Xj . 

2) Compute weight matrix W 

If points i and j are connected, assign Wij = e * , otherwise Wij = 0. 

3) Construct low-dimensional embedding 
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Solve the generalized eigenvalue problem 

Lf = XDt, 

where D is the diagonal weight matrix in which Da = J2j and L = Z) — is the Laplacian 
matrix. If [f 1 , . . . , frf] is the collection of eigenvectors associated with d smallest generalized eigen- 
values which solve the above, the d-dimensional embedding is defined by = {vn, . .. , Vid) , 1 < 
i < n. 

3) Additional MDS Methods: While we choose to only detail the cMDS and LEM algorithms, there are 
many other methods for performing dimensionality reduction in a Hnear fashion (PCA) and non-linearly 
(Local Linear Embedding [26]) for unsupervised learning. For supervised learning there are also linear 
(Linear Discriminant Analysis) and non-linear (Classification Constrained Dimensionality Reduction [27], 
Neighbourhood Component Analysis [28]) methods, all of which can be applied to our framework. We 
do not highlight the heavily utilized Isomap [21] algorithm since it is identical to using cMDS on the 
approximation of the geodesic distances. 

IV. Our Techniques 

We have presented a series of methods for manifold learning developed in the field of information 
geometry. By performing dimensionality reduction on a family of data sets, we are able to both better visu- 
aUze and classify the data. In order to obtain a lower dimensional embedding, we calculate a dissimilarity 
metric between data sets within the family by approximating the Fisher information distance between 
their corresponding PDFs. This has been illustrated with the family of univariate normal probabiUty 
distributions. 

In problems of practical interest, however, the parameterization of the probability densities are usually 
unknown. We instead are given a family of data sets X = {Xi,X2, ■ ■ ■ ,Xn}, in which we may 
assume that each data set Xi is a reahzation of some underlying probability distribution to which we 
do not have knowledge of the parameters. As such, we rely on non-parametric techniques to estimate 
both the probability density and the approximation of the Fisher information distance. Following these 
approximations, we are able to perform the same multi-dimensional scaling operations as previously 
described. 

A. Kernel Density Estimation 

Kernel methods are non-parametric techniques used for estimating probability densities of data sets. 
These methods are similar to mixture-models in that they are defined by the normalized sum of multiple 
densities. Unlike mixture models, however, kernel methods are non-parametric and are comprised of the 
normalized sum of identical densities centered about each data point within the set (14). This yields a 
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density estimate for the entire set in that highly probable regions will have more samples, and the sum 
of the kernels in those areas will be large, corresponding to a high probability in the resultant density. 
The kernel density estimate (KDE) of a PDF is defined as 

1=1 ^ ' 

where K is some kernel satisfying the properties 

K{x) > 0, Va; G X, 

and h is the bandwidth or smoothing parameter. 

There are two key points to note when using kernel density estimators. First, it is necessary to determine 
which distribution to use as the kernel. Without a priori knowledge of the original distribution, we choose 
to use Gaussian kernels, 

= (2vr)w')|E|V2 (-r"^"") ' "^'^ 

where d is the dimension of x and E is the covariance matrix, as they have the quadratic properties that will 
be useful in implementation. Secondly, the bandwidth parameter is very important to the overall density 
estimate. Choosing a bandwidth parameter too small will yield a peak filled density, while a bandwidth 
that is too large will generate a density estimate that is too smooth and loses most of the features of the 
distribution. There has been much research done in calculating optimal bandwidth parameters, resulting 
in many different methods [29], [30] which can be used in our framework. 

We note that the mean squared error of a KDE decreases only as Tr'^^^^'^\ which becomes extremely 
slow for large d. As such, it may be difficult to calculate good kernel density estimates. However, for 
our purposes, the estimation of densities is secondary to the estimation of the divergence between them. 
As such, the issues with MSB of density estimates in large dimensions, while an area for future work, 
is not of immediate concern. 

B. Algorithm 

Fisher Information Non-parametric Embedding (FINE) is presented in Algorithm 1 and combines all 
of the methods we have presented in order to find a low-dimensional embedding of a collection of data 
sets. If we assume each data set is a reahzation of an underlying PDF, and each of those distributions he 
on a manifold with some natural parameterization, then this embedding can be viewed as an embedding 
of the actual manifold into Euchdean space. Note that in line 1, 'embed(G, d)' refers to using any multi- 
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Algorithm 1 Fisher Information Non-parametric Embedding 

Input: Collection of data sets X = {Xi, X2, ■ ■ ■ , Xj\[}; the desired embedding dimension d 
1: for z = 1 to do 

2: Calculate Pi{x), the density estimate of Xi 
3: end for 

4: Calculate G, where G{i,j) = Dp{pi,pj), the geodesic approximation of the Fisher information 

distance 
5:1" = embed(G, d) 
Output: d-dimensional embedding of X, into Euclidean space Y G R'^^^ 




dimensional scaling method (such as cMDS, Laplacian Eigenmaps, etc.) to embed the dissimilarity matrix 
G into a Euclidean space with dimension d. 

V. Applications 

We have illustrated the uses of the presented framework in the previous sections with a manifold 
consisting of the set of univariate normal densities, V. We now present several synthetic and practi- 
cal applications for the framework, all of which are based around visualization and classification. In 
each application, the densities are unknown, but we assume they lie on a manifold with some natural 
parameterization. 

A. Simulated Data 

To demonstrate the ability of our methods to reconstruct the statistical manifold, we create a known 
manifold of densities. Let Y = {yi, . . . ,?/„}, where each yi is uniformly sampled on the 'swiss roll' 
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Fig. 7. Historically, the process of clinical flow cytometry analysis relies on a series of 2-dimensional scatter plots in which 
cell populations are selected for further evaluation. This process does not take advantage of the multi-dimensional nature of the 
problem. 



manifold (see Fig. 6(a)). Let X = {Xi, X2, ■ ■ ■ , Xn} where each Xi is generated from a normal 
distribution J\f{yi, S), where S is held constant for each density. As such, we have developed a statistical 
manifold of known parameterization, which is sampled by known PDFs. Utilizing FINE in an unsuper- 
vised manner, we are able to recreate the original manifold Y strictly from the collection of data sets 
X. This is shown in Fig. 6(b) where each set is embedded into 3 cMDS dimensions, and the 'swiss roll' 
is reconstructed. While this embedding could easily be constructed using the mean of each set Xi as a 
Euclidean location, it illustrates that FINE can be used for visualizing the statistical manifold as well, 
without a priori knowledge of the data. 

B. Flow Cytometry 

In clinical flow cytometry, cellular suspensions are prepared from patient samples (blood, bone marrow, 
and solid tissue), and evaluated simultaneously for the presence of several expressed surface antigens and 
for characteristic patterns of light scatter as the cells pass through an interrogating laser. Antibodies to each 
target antigen are conjugated to fluorescent markers, and each individual cell is evaluated via detection 
of the fluorescent signal from each marker. The result is a characteristic multi-dimensional distribution 
that, depending on the panel of markers selected, may be distinct for a specific disease entity. The data 
from clinical flow cytometry can be considered multi-dimensional both from the standpoint of multiple 
characteristics measured for each cell, and from the standpoint of thousands of cells analyzed per sample. 
Nonetheless, clinical pathologists generally interpret clinical flow cytometry results in the form of two- 
dimensional scatter plots in which the axes each represent one of multiple cell characteristics analyzed (up 
to 8 parameters per cell in routine clinical flow cytometry, and many more parameters per cell in research 
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Fig. 8. 2-dimensional plots of disease classes CLL and MCL. The overlapping nature of the scatter plots makes it difficult for 
pathologists to differentiate disease classes using primitive 2-dimensional axes projections. 



applications). Additional parameters are often utilized to "gate" (i.e. select or exclude) specific cell sets 
based on antigen expression or light scatter characteristics; however, clinical flow cytometry analysis 
remains a step-by-step process of 2-dimensional histogram analysis (Fig. 7), and the multidimensional 
nature of flow cytometry is routinely underutilized in clinical practice. 

An example of the difficulty in analysis of 2-dimensional scatter plots is illustrated in Fig. 8. Two 
distinct disease classes, mantle cell lymphoma (MCL) and chronic lymphocytic leukemia (CLL), are 
illustrated with both scatter and contour plots. Each point represents a distinct blood cell from two different 
patients, each containing one of the specified diseases; the axes represent those which pathologists have 
determined to be the two markers which are most differentiating for these two disease classes. It is clear 
that for these two patients there is significant similarity in the scatter and contour plots of the data. The 
overlapping nature of these 2-dimensional scatter plot leads to a very primitive analysis of the available 
data. It would be potentially beneficial, therefore, to develop systems for clustering and classification 
of clinical flow cytometry data that utilize all dimensions of data derived for each cell during routine 
clinical analysis. The variability of distributions of data in multidimensional flow cytometry over various 
patients is smaller than that associated with a general characterization of a multivariate distribution. 
This leads us to believe that these distributions exist on some manifold with a much lower dimensional 
parameterization. Hence, we should be able to use FINE for the purpose of viewing a natural clustering 
of different patients into their respective disease classes based on the full set of markers evaluated in 
each multiparameter flow cytometric analysis. 

For this analysis, we will compare patients with two distinct but immunophenotypically similar forms 
of lymphoid leukemia - mantle cell lymphoma (MCL) and chronic lymphocytic leukemia (CLL), as 
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Fig. 9. 2-dimensional embedding of CLL (•) and MCL (+) patients using FINE with cMDS and the Kullback-Leibler divergence 
as a dissimilarity metric. The circled points correspond to the CLL and MCL cases highlighted in Fig. 8, which are difficult to 
discern with scatter plots, but well separated in the FINE space. 



illustrated in Fig. 8. These diseases display similar characteristics with respect to many expressed surface 
antigens, but are generally distinct in their patterns of expression of two common B lymphocyte antigens 
CD23 and FMC7 (a distinct conformational epitope of the CD20 antigen). Typically, CLL is positive 
for expression of CD23 and negative for expression of FMC7, while MCL is positive for expression of 
FMC7 and negative for expression of CD23. These distinctions should lead to a difference in densities 
between patients in each disease class, and should show a natural clustering. 

Let X = {Xi,X2, ■ ■ ■ ,X„} where Xi is the data set corresponding to the flow cytometer output 
of the z*^ patient. Each patient's blood is analyzed for 5 parameters: forward and side light scatter, 
and 3 fluorescent markers (CD45, CD23, FMC7). Hence, each data set Xi is 5-dimensional with 
elements corresponding to individual blood cells (each may be different). Given that X is comprised 
of both patients with CLL and patients with MCL, we wish to analyze the performance of FINE for the 
visualization and clustering of cytometry data. 

The data set consists of 23 patients with CLL and 20 patients with MCL. The set Xi for each patient 
is on the order of 5000 cells. The data and clinical diagnosis for each patient was provided by the 
Department of Pathology at the University of Michigan. Figure 9 shows the 2-dimensional embedding 
with FINE, using cMDS and the Kullback-Leibler divergence set as the dissimilarity metric. Each point 
in the plot represents an individual patient. Although the discussed methods perform the dimensionality 
reduction and embedding in unsupervised methods, we display the class labels as a means of analysis. 
It should be noted that there exists a natural separation between the different classes. As such, we can 
conclude that there is a natural difference in probability distribution between the disease classes as well. 
Although this is known through years of clinical experience, we were able to determine this without any 
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a priori knowledge; simply with a density analysis. 

An important byproduct of this natural clustering is the abihty to visuahze the cytometry data in a 
manner which allows comparisons between patients. The circled points in Fig. 9 correspond to the patients 
illustrated in Fig. 8, which were difficult to differentiate by using a scatter plot of the most discerning 
marker combination as deemed by pathologists. In the space defined by FINE, the patients are easily 
differentiated and lie well within the clusters of each disease type. By using the embedding created with 
FINE, pathologists are able to determine similarities between patients, which gives them a quick and easy 
means of determining which data sets may need further investigation (i.e. for possible misdiagnosis). 

C. Document Classification 

Recent work has shown in interest in using dimensionaUty reduction for the purposes of document 
classification [31] and visuahzation [32]. Typically documents are represented as very high-dimensional 
PDFs, and learning algorithms suffer from the curse of dimensionality. Dimensionality reduction not only 
alleviates these concerns, but it also reduces the computational complexity of learning algorithms due to 
the resultant low-dimensional space. As such, the problem of document classification is an interesting 
apphcation for FINE. 

Given a collection of documents of known class, we wish to best classify a document of unknown 
class. A document can be viewed as a reahzation of some overriding probability distribution, in which 
different distributions will create different documents. For example, in a newsgroup about computers you 
could expect to see multiple instances of the term "laptop", while a group discussing recreation may see 
many occurrences of "sports". The counts of "laptop" in the recreation group, or "sports" in the computer 
group would predictably be low. As such, the distributions between articles in computers and recreation 
should be distinct. In this setting, we defined the PDFs as the term frequency representation of each 
document. Specifically, let Xi be the number of times term i appears in a specific document. The PDF 
of that document can then be characterized as the multinomial distribution of normalized word counts, 
with the maximum hkehhood estimate provided as 



By utilizing the term frequencies as a multinomial distribution, and not implementing a kernel density 
estimator, we show that our methods are not tied to the KDE, but we simply use it in the case of 
continuous densities as a means of estimation. If one has a priori knowledge of the distribution, that 
step is unnecessary. Additionally, we use the Hellinger distance due to the multinomial nature of the 
distribution. As described in Section 111-A3, Dh has a monotonic transformation to Dc, which is the 
natural metric on the sphere defined by multinomial PDFs. 
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Fig. 10. 2-dimensional embeddings of 20 Newsgroups data. The data displays some natural clustering, in the information based 
embedding, while the PCA embedding does not distinguish between classes. 



For illustration, we will utilize the well known 20 Newsgroups data set', which is commonly used 
for testing document classification methods. This set contains word counts for postings on 20 separate 
newsgroups. We choose to restrict our simulation to the 4 domains with the largest number of sub- 
domains (comp.*, rec.*, sci.*, and talk.*), and wish to classify each posting by its highest level domain. 
Specifically we are given V = {pi, . . . ,pn} where each pi corresponds to a single newsgroup posting 
and is estimated with (16). We note that the data was preprocessed to remove all words that occur in 5 
or less documents^. 

1 ) Unsupervised FINE: First, we utilize unsupervised methods to see if the natural geometry exists 
between domains. Using Laplacian Eigenmaps on the dissimilarities calculated with the Hellinger distance, 
we found an embedding V —>■ M."^. Figure 10(a) shows the natural geometric separation between the 
different document classes, although there is some overlap (which is to be expected). Contrarily, a 
Principal Components Analysis (PCA) embedding (Fig. 10(b)) does not demonstrate the same natural 
clustering. PCA is often used as a means to lower the dimension of data for learning problems due to its 
optimality for Euclidean data. However, the PCA embedding of the 20 Newsgroups set does not exhibit 
any natural class separation due to the non-EucUdean nature of the data. 

We now compare the classification performance of FINE to that of PCA. In the case of document 
classification, dimensionality reduction is important as the natural dimension (i.e. number of words) for 
the 20 Newsgroups data set is 26,214. Using local intrinsic dimension estimation [33], Fig. 11 shows 

'http://people.csail.mit.edu/jrermie/20Newsgroups/ 
^http://www.cs. uiuc.edu/homes/dengcai2/Data/TextData.html 
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Fig. 11. Local dimension estimates for each document from a random subset of 4020 documents in the 20 Newsgroups data 
set. 
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Fig. 12. Classification rates for low-dimensional embedding using different methods for dimensionality reduction. 1-standard 
deviation confidence intervals shown over 20-fold cross validation. 



the histogram of the true dimensionality of the sample documents, so we test performance for low- 
dimensional embeddings P — > M*^ for d G [5,95]. Following each embedding, we apply an SVM with 
a linear kernel to classify the data in an 'all-vs-all' setting (i.e. classify each test sample as one of 4 
different potential classes in a single event, rather than 4 separate binary events). The training and test 
sets were separated according to the recommended indices, and each set was randomly sub-sampled for 
computational purposes, keeping the ratio of training to test samples constant (2413 training samples, 
1607 test samples). Both the FINE and PCA settings jointly embed the training and test sets. 

Figure 12 illustrates that the embedding calculated with FINE outperforms using PCA as a means 
of dimensionality reduction. The classification rates are shown with a 1-standard deviation confidence 



February 14, 2008 



DRAFT 



24 



0.015 
0.01 
0.005 


-0.005 

-0.01 
-15 

Fig. 13. 3-dimensional embedding of 20 Newsgroups corpus using FINE in a supervised manner. 

interval, and FINE with a dimension as low as d = 25 generates results comparable to those of a PCA 
embedding with d = 95. To ease any concerns that Laplacian Eigenmaps (LEM) is simply a better 
method for embedding these multinomial PDFs, we calculated an embedding with LEM in which each 
PDF was viewed as a Euclidean vector with the L2 -distance used as a dissimilarity metric. This form 
of embedding performed much worse than the information based embedding using the same form of 
dimensionality reduction and the same linear kernel SVM, while comparable to the PCA embedding in 
very low dimensions. 

2) Supervised FINE: If we allow FINE to use supervised methods for embedding, we can dramatically 
improve classification performance. By embedding with Classification Constrained Dimensionality Re- 
duction (CCDR) [27], which is essentially LEM with an additional tuning parameter defining the emphasis 
on class labels in the embedding, we are able to get good class separation even in 3 dimensions (Fig. 13). 
We now compare FINE to the diffusion kernels developed by Lafferty and Lebanon [12] for the purpose 
of document classification. The diffusion kernels method uses the full term-frequency representation of 
the data and does not utilize any dimensionality reduction. We stress this difference to determine whether 
or not using FINE for dimensionality reduction can generate comparable results. 

We first illustrate the classification performance in a 'one vs. all' setting, in which all samples from 
a single class were given a positive label (i.e. 1) and all remaining samples were labeled negatively 
(i.e. —1). In the FINE setting, we first subsampled from the training and test sets, using a test set size 
of 200, then used CCDR to embed the entire data set into M°', with d G [5,95] chosen to maximize 
classification performance. The classification task was performed using a simple linear kernel SVM, 

K{X, Y)=X-Y. 
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TABLE I 

Experimental results on 20 Newsgroups corpus, comparing FINE using CCDR and a linear SVM to a 

MULTINOMIAL DIFFUSION KERNEL BASED SVM. THE PERFORMANCE (CLASSIFICATION RATE IN %) IS REPORTED AS MEAN 
AND STANDARD DEVIATION FOR DIFFERENT TRAINING SET SIZES L, OVER A 20-FOLD CROSS VALIDATION. 



For the diffusion kernels setting, 

K{X, Y) = (47rt) 2 exp (^-^ arccos^ (Vx ■ VY^j , 

we chose parameter value t which optimized the classification performance at each iteration. The exper- 
imental results of performance versus training set size, with 20-fold cross vaUdation, are shown in Table 
I, where the highest performance at each range is highlighted. FINE shows a significant performance 
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Fig. 14. Classification rates for low-dimensional embedding with FINE using CCDR vs Diffusion kernels. The classification task 
was all v.s. all. Rates are plotted versus number of training samples. Confidence intervals are shown at one standard deviation. 
For comparison to the joint embedding (FINE), we also plot the performance of FINE using out of sample extension (OOS). 



increase over the diffusion kernels method for sets with low sample size. As the sample size increases, 
however, the gap in performance between the diffusion kernels method and FINE decreases, with diffusion 
kernels eventually surpassing FINE. 

We now modify the classification task from a 'one v.s. all' to an 'all v.s. all' setting, in which each 
class is given a different label and the task is to assign each test sample to a specific class. Classification 
rates are defined as the number of correctly classified test samples divided by the total number of test 
samples (kept constant at 200). The structure of the experiment is otherwise identical to the 'one v.s. 
air setting. We once again notice in Fig. 14 that FINE outperforms the diffusion kernels method for 
low sample sizes. The point at which the diffusion kernels method surpasses FINE has decreased (i.e. 
L 200 for 'all v.s. all' compared to L w 600 for 'one v.s. all'), yet FINE is still competitive as the 
sample size increases. 

While our focus when using FINE has been on jointly embedding both the training and test samples 
(while keeping the test samples unlabeled). Fig. 14 also illustrates the use of out of sample extension 
(OOS) [34] with FINE. In this scenario, the training samples are embedded as normal with CCDR, while 
the test samples are embedded into the low-dimensional space using interpolation. This setting allows for 
a significant decrease in computational complexity given the fact that the FINE embedding has already 
been determined for the training samples (i.e. new test samples are received). A decrease in performance 
exists when compared to the jointly embedded FINE, which is reduced as the number of training samples 
increases. 

Analysis of the results in both the 'one v.s. all' and 'all v.s. all' cases shows that FINE can improve 
upon the deficiencies of the diffusion kernels method in the low sample size region. By viewing each 
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Fig. 15. Comparison of classification performance on the 20 Newsgroups data set with FINE using different SVM kernels; 
one linear and two non-linear (2"** polynomial and radial basis function). 



document as a coarse approximation of the overriding class PDF, it is easy to see that, for low sample 
sizes, the estimate of the within class PDF generated by the diffusion kernels will be highly variable, which 
leads to poor performance. By reducing the dimension with FINE, the variance is limited to significantly 
fewer dimensions, enabling documents within each class to be drawn nearer to one another. While this 
could also bring the classes closer to each other, the utilization of CCDR ensures class separation. This 
results in better classification performance than using the entire multinomial distribution. As the number 
of training samples increases, the effect of dimensionality is reduced, which allows the diffusion kernels 
to better approximate the multinomial PDF representative of each class. This reduction in variance across 
all dimensions ensures that a few anomalous documents will not have the same drastic effect as they 
would in the low sample size region. As such, the performance gain surpasses that of FINE, due to 
the fact that the curse of dimensionality was alleviated elsewhere (i.e. increase in sample size). We note 
that while FINE performs slightly worse than diffusion kernels in the large sample size region, it still 
performs competitively with a leading classification method which utilizes the full dimensional data. 

An additional reason for the diffusion kernels improved performance over FINE in the large sample 
size region is that we have restricted FINE to using a linear kernel for this experiment, while the diffusion 
kernels method is very non-linear. We do this to show that even a simple linear classifier can perform 
admirably in the FINE reduced space. Using a non-linear kernel would show increased performance 
with FINE. This is illustrated in Fig. 15, where we compare the performance of FINE using an SVM 
classifier with a Unear kernel {K{X, Y) = X'^Y), 2"^^ degree polynomial kernel {K{X, Y) = {■yX'^Yf), 
and a radial basis function kernel {K{X,Y) = ex.p{—j\X — Y\'^)), where 7 is a weighting constant. 
For visualization purposes, we show the results for only a subset of the training sample range (i.e. 
L = [200, 400]), but it is clear that the use of non-linear kernels improves the performance of FINE. The 
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problem of which of the many possible non-linear kernels is optimal remains open and is a subject for 
future work. 

VI. Conclusions 

The assumption that high-dimensional data Ues on a EucUdean manifold is based on the ease of 
implementation due to the wealth of knowledge and methods based on Euclidean space. This assumption 
is not viable in many problems of practical interest, as there is often no straightforward and meaningful 
Euclidean representation of the data. In these situations it is more appropriate to assume the data lies on 
a statistical manifold. Using information geometry, we have shown the abiUty to find a low-dimensional 
embedding of the manifold, which allows us to not only find the natural separation of the data, but to 
also reconstruct the original manifold and visuaHze it in a low-dimensional Euchdean space. This allows 
the use of many well known learning techniques which work based on the assumption of Euclidean data. 

By approximating the Fisher information distance, FINE is able to construct the Euclidean embedding 
with an information based metric, which is more appropriate for non-Euclidean data. We have illustrated 
this approximation by finding the length of the geodesic along the manifold, using approximations such 
as the KuUback-Leibler divergence and the HelUnger distance. The specific metric used to approximate 
the Fisher information distance is determined by the problem, and FINE is not tied to any specific 
choice of metric. Additionally, we point out that although we utiUze kernel methods to obtain PDFs, the 
method used for density estimation is only of secondary concern. The primary focus is the measure of 
dissimilarity between densities, and the method used to calculate those PDFs is similarly determined by 
the problem. 

We have illustrated FINE's ability to be used in a variety of learning tasks such as visuaUzation, 
clustering, and classification. FINE is a framework that can be used for a multitude of problems which 
may seem to have little to nothing in common, such as flow cytometry and document classification. The 
only commonaUty between the problems is that each are based around data which has no straightforward 
Euclidean representation, which is the only setting needed to utilize FINE. In future work we plan to 
utiHze different classification methods (such as A;-NN and using different SVM kernels) to maximize 
our document classification performance. This includes constraining our dimensionaUty reduction to a 
sphere, which will allow the use of diffusion kernels in a low-dimensional space. We also plan to continue 
studies on the effect of using out of sample extension on our performance. Lastly, we will continue to 
find apphcations which fit the setting for FINE, such as internet anomaly detection and face recognition, 
and determine whether or not these problems would benefit from our framework. 
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